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Abstract Body 


Background / Context: 

Randomized experiments are eommonly used to evaluate the effeetiveness of edueational 
interventions. The main foeus in randomized experiments is often on the average treatment effeet 
aeross all partieipants in the study, yet when the effeetiveness of an intervention varies, a single 
summary effeet may be of limited utility. Instead, understanding what works for whom, when, 
and where matters. The questions of this eonferenee - regarding the optimal age for an 
intervention and the effeet of the intervention on outeomes at different time points - are 
inherently questions of this type. Questions regarding moderation ean be addressed in several 
different ways: 1) through the inelusion of multiple eohorts of students (e.g., K and 3'^‘* graders) 
or through longitudinal designs (e.g., outeomes at 1, 2, and 3 years) within individual 
experiments; 2) through the aeeumulation of evidenee aeross studies, synthesized using meta- 
analysis. This paper foeuses on this seeond approaeh, whieh we argue is partieularly important 
beeause individual studies are rarely powered adequately to deteet treatment effeet interaetions. 

Over the past 30 years, meta-analysis has been widely used in edueation researeh. A 
reeent innovation in meta-analysis is the introduetion of a robust varianee estimator (RVE) that 
allows for the inelusion of multiple, eorrelated effeet sizes in a meta-analysis (Hedges, Tipton, 
and Johnson, 2010); to date, this method has been used in over 50 meta-analyses in as diverse 
fields as eeology, edueation, psychology, and intervention studies. An advantageous feature of 
RVE is that it does not require information on the true correlation structure of the estimates 
within a given study, which are rarely reported in practice. 

The statistical theory behind the robust variance estimation method is asymptotic; in 
large-enough samples, it has been shown to be an unbiased estimator of the true sampling 
variance. In small samples, however, the estimator can be biased and the Type I error rate of tests 
based upon the RVE method can be much too liberal (Hedges et al, 2010; Tipton, 2013). This 
represents a serious limitation, given that as many as half of recent meta-analyses in education 
contained fewer than 40 studies (Ahn, Ames, & Myers, 2012). To address this shortcoming, 
Tipton (2014) proposed small-sample corrections for hypothesis tests of single meta-regression 
coefficients (i.e., t-tests), which have close to nominal Type-I error even when the number of 
studies is small. 

Purpose / Objective / Research Question / Focus of Study: 

The goal of the present investigation is to develop small-sample corrections for multiple 
contrast hypothesis tests (i.e.. E-tests) such as the omnibus test of meta-regression fit or a test for 
equality of three or more levels of a categorical moderator. Eor example, studies might be 
conducted on students of different ages, resulting in a covariate grade-level with three levels: 
“elementary”, “middle”, or “high” school. In order to answer the questions “Does the 
effectiveness of this intervention vary in relation to age?” an E-test would need to be conducted. 
Currently, it is not possible to conduct E-tests of this type in RVE. Eacking valid testing 
methods, researchers are left to either rely on asymptotic approximations, which can be seriously 
in error, or to cobble together ad-hoc methods, such as using RVE with all effect sizes to conduct 
t-tests, but ANOVA with study-aggregated effect sizes to conduct E-tests. 

Significance / Novelty of study: 
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Drawing on work that addresses related, simpler problems and speeial eases of eluster- 
robust varianee estimation, we develop three small-sample tests based on different 
approximations to the distribution of a robust Wald test statistie. In the remainder, we deseribe 
our modeling assumptions, proposed tests, and some initial simulation result. These 
approximations are drawn from a wide array of areas within statisties, ranging from 
eeonometries to survey sampling. The paper presents both new analytie work deseribing these 
small-sample eorreeted test statisties and the results of a large simulation study that eompares 
these potential solutions, as well as a diseussion of the implieations of our findings for praetiee. 

Statistical, Measurement, or Econometric Model: 

We develop the methods under the general meta-regression model 

y,=x,p+£, 

where y, is a veetor of ki effeet size estimates from study i, X/ is an ki x p matrix of eovariates, 
and Si is a veetor of (potentially eorrelated) errors with Var(£. ) = ^, - Importantly, the strueture 

of X, is typieally unknown, and may involve a eombination of several eorrelation struetures. 

Let W, be a k, x ki weighting matrix based on a “working” eovarianee model (see Tipton, 
2014 for a diseussion of how to ehoose the working model). The WLS estimator of P is 

, where M = Y W.X. 

^ I I I 

V '■=' 

Note that if the working model is eorreet, as is typieally assumed in univariate meta-analysis, 
then W, = X7' for / = 1,. . ., m and Var (b) = M . Following Tipton (2014), we employ an 

unbiased form of the robust varianee estimator developed by MeCaffrey, Bell, and Botts (2001), 
given by 

m 

V"=M yX,'’W,A,e,e,’’AfW,X, M, 

. !=1 

where t. = y . - X.b . The ki x ki matriees Ai are ehosen sueh that if the weights are truly inverse- 

varianee (i.e. W, = X7'), then the varianee estimator is exaetly unbiased: ^ = M. Note that 

Tipton (2014) derives the appropriate A, for the two eommonly used models in RVE - 
hierarehieal and eorrelated effeets. 

In this paper, we are interested in testing the null hypothesis Hq. Cp = 0 for the q ^ p 
eontrast matrix C. For example, an omnibus test might be written Hq: P\= ^2 = ■•■ = Pp-\ = 0, or a 
test of a eategorieal variable with three levels might be written Hq: = 0. A Wald-type test 

statistie for the multi-parameter hypothesis Hq is given by 

Q = b^C^ (CV'^C^ Cb = z^D-'z, 

where z = (CMC^ ) Cb and D = (CMC^ ) CV*C^ (CMC^ ) . As m inereases, the 

distribution of Q approaehes that of a ehi-squared random variate with q degrees of freedom 
(Wooldridge, 2002). However, the asymptotie distribution may provide a very poor 
approximation when m is small, leading to aetual type I error rates far in exeess of the nominal 
level. Furthermore, it is often unelear when one has a suffieient sample of studies to trust the 
asymptotie test; as Tipton (2014) shows with t-tests, the degrees of freedom for the assoeiated 
tests depend not only on the number of studies (m), but also on features of the eovariates. 


b = M 
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Methods 

In this paper, we consider three adjusted tests for Hq that have improved type I error rates. 
All three tests are based on quantities derived from the variances and covariances of the entries 
in the matrix D under a working covariance structure. Let dst denote the entry in D. Under 
the working covariance model and assuming that the errors are normally distributed, we can 
obtain expressions for Var ) and Cov ) based on the fact that dst is a quadratic form. 

(We omit the expressions here due to space constraints.) 

Approximate Satterthwaite correction. The first test employs a Satterthwaite-type 
correction, wherein we find a multiplier d and degrees of freedom rj such that the first two 
moments of SQ approximately match those of an F{q, rj) distribution. We can show that 

E(G) = ?+r = £j, 


Var(e) = 2^+6r + 2?r+S-r"=£j, 

q q q q 

where T = SSVar(^/,J and S = ^^Cov(J 55 •>^uu)’ follows that 


5=1 /=1 


5=1 W=1 


g^ Eliq 2) + 2qV 2 (.? + 2) 

‘lEg{V^+4) tlVg-lEl 


approximation. An alternative test can be derived by approximating the distribution 
of D with a Wishart distribution. This approach has been considered previously for special cases 
of cluster-robust variance estimation, including one-way heteroskedastic ANOVA (Zhang, 2013) 
and the multivariate Behrens-Fisher problem (Krishnamoorthy & Yu, 2004), but never in the 
general case. Following Zhang (2013), we derive the approximation by matching the expectation 
and total variance of D to those of a Wishart distribution with v degrees of freedom. 
Approximating the distribution of D by a Wishart implies that Q approximately follows 
Hotelling’s 7^ distribution when Ho is true. From the properties of the distribution, it then 

follows that 

^^^^0^f(^,v-^ + i), 

vq ^ ' 

which can be used to test Ho. For a one-dimensional contrast {q = 1), the approximation is 
exactly equivalent to the Satterthwaite approximation studied by Tipton (2014). 


Spectral decomposition and transformation (SDT). Whereas the previous two tests 
sought approximations to the distribution of Q, the final test involves altering its internal 
structure, using an approach very similar to one developed by Alexander and Govern (1994) for 
heteroskedastic one-way ANOVA and by Cai and Hayes (2008) for heteroskedasticity-robust 
variance estimation (both of which are simpler cases than the cluster-robust variance estimation 
methods considered here). The SDT test entails first expressing Q as a sum of q squared t- 
variates, then applying a normalizing transformation to each variate, yielding a test statistic that 
is closer to chi-square distributed. 


Research Design: Simulation Study 

In order to evaluate these potential small-sample corrections, as well as to determine 
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when the methods are needed (i.e., the line between “small” and “large”), we condueted a 
simulation study. The simulation follows a strueture similar to the seeond study reported in 
Tipton (2014). This design ineluded a single meta-regression model with 5 eovariates - of these, 
two are eonstant at the study level (a common feature in meta-analyses) and 3 vary at the effect 
size level; additionally, one of the eovariates has high leverage and another has large imbalances 
(two conditions that Tipton found have large effects on performance). We simulated correlated 
standardized mean difference effect sizes, as might be found in randomized experiments that 
report treatment effects on multiple outcome measures. We used a diagonal weight matrix for the 
working covariance model; conditions with non-zero correlation between outcomes or non-zero 
between-study variability therefore represent varying degrees of model misspecification. We 
considered hypothesis tests for each of the 26 possible subsets of two or more eovariates, 
including the omnibus test of model fit H^: = 0. Table 1 summarizes the 

design of the simulation. For each combination of factor levels, we simulated 5,000 meta- 
analyses. 

Findings / Results: Results of Simulation Study 

Due to space constraints, we describe the results only for the nominal a = .05 level. 
Furthermore, here we focus on the results in relation to the number of studies in the meta- 
analysis (m); in the paper, we also investigate the role of the degrees of freedom. Figure 1 
summarizes the range of Type-I error rates for the conventional Wald test across the various 
combinations of simulation factors; each panel displays the results for the hypothesis tests of the 
same dimension {q). The error rates of the Wald test far exceed the nominal level, particularly for 
higher q. Even at the largest sample size considered, the Wald test has unacceptably inaccurate 
error rates. Figure 2 plots the range of Type-I error rates for each of the tests described above. 

All three corrected tests are more accurate than the Wald test. The 7^ test is generally 
conservative, with error rates that seldom exceed the nominal level. The error rates of the 
Satterthwaite test sometimes exceed .05, but are mostly quite accurate when m is 30 or larger. 

The error rates of the SDT test tend to exceed .05 and are more variable than those of the 
Satterthwaite test. In the paper, we further elucidate the conditions under which the Satterthwaite 
and 7^ tests are most appropriate, including a discussion of power. 

Usefulness / Applicability of Method: 

In order to illustrate the usefulness of the method, we include an example based on a meta- 
analysis by Tanner-Smith and Lipsey (2013). This meta-analysis combined results of 
randomized-experiments evaluating the effectiveness of brief alcohol interventions on subjects of 
different ages (i.e., adolescents and young adults) and over multiple time points and waves. 

Conclusions: 

The results of the simulation study indicate that the asymptotic chi-squared test does not perform 
well unless the number of studies (m) is very large relative to the dimension of the test {q). In this 
paper, we investigated several small sample corrections and found two that performed best, in 
terms of both Type I and II error. Finally, while this paper focuses on the RVE context, we 
expect that these same techniques will have use in other contexts, including analysis of cluster- 
randomized trials (using hierarchical linear models) and econometric analysis of panel data. 
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Appendix B, Tables and Figures 


Table 1. Simulation study design 


Factor 

Levels 

Independent studies {m) 

10, 15, 20, 30, 40 - 200 (in units of 20) 

Effect sizes per study {ki,...,km) 

constant at 10 or varied, ranging from 1 to 10 

Sample size per study 

constant at 30 or varied, ranging from 32 to 
130 

Correlation between the outcome measures 

.0, .5, .8 

Between-study variability in true effect sizes 
as a proportion of total variation (/) 

.00, .33, .50 
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Figure 2. Type-I error rates of Wald, 7^, Satt 
for nominal a = .05 
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